Software Analysis with Unsupervised Topic Models
نویسندگان
چکیده
We provide an overview of our work in applying unsupervised topic and authortopic models based on Latent Dirichlet Allocation (LDA) to the problem of mining large software repositories at multiple levels of granularity. Our approaches allow us to automatically discover the topics embedded in code and extract documenttopic and author-topic distributions. In addition to serving as a convenient summary for program content and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing software complexity, developer similarity, and the evolution of software over the release timeline.
منابع مشابه
Traffic Scene Analysis using Hierarchical Sparse Topical Coding
Analyzing motion patterns in traffic videos can be exploited directly to generate high-level descriptions of the video contents. Such descriptions may further be employed in different traffic applications such as traffic phase detection and abnormal event detection. One of the most recent and successful unsupervised methods for complex traffic scene analysis is based on topic models. In this pa...
متن کاملIs Your Anchor Going Up or Down? Fast and Accurate Supervised Topic Models
Topic models provide insights into document collections, and their supervised extensions also capture associated document-level metadata such as sentiment. However, inferring such models from data is often slow and cannot scale to big data. We build upon the “anchor” method for learning topic models to capture the relationship between metadata and latent topics by extending the vector-space rep...
متن کاملUnsupervised Modeling, Detection and Localization of Anomalies in Surveillance Videos
Most techniques today focus either on trajectory clustering or capturing intrinsic scene features to detect and identify the abnormal content in videos. On lines similar to the latter paradigm, we model the usual and dominant behavior of videos using unsupervised probabilistic topic models, as complement of which we identify the “anomalous” ones. Through this paper, we make the following contri...
متن کاملMining Internet-Scale Software Repositories
Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we first develop an infrastructure for the automated crawling, parsing, and database storage of open source software. The infrastructure allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling...
متن کاملMining Internet-Scale Software Repositories
Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. Sourcerer allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totali...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009